细胞器组装 -- 二三代植物线粒体 -- GSAT - Graph-based Sequence Assembly Toolkit

Cite

〇.项目应用

cp /share/nas2/yuj/project/2024/plant_mt/GP-20240318-8017_20240412/data/gsat.conf gsat.cfg

1.双端reads路径,输出路径“01graphShort”
gsat graphShort -conf gsat.cfg

2.???三代数据“map_gene.fa”,图文件“og.filtered.gfa”,输出路径“02graphLong”
gsat graphLong -conf gsat.cfg

2.???
gsat graphMap -a on -r /share/nas2/yuj/project/2024/plant_mt/GP-20240319-8026_20240412/data/3dai/map_gene.fa -g 01graphShort/og.filtered.gfa -o 002graphMap -d on -minimap2 ont



一.简介

GSAT 是一个基于图形的序列装配工具包,它提供了一系列的命令和选项来处理和分析图形数据。
GSAT 是一个高效的基于图形的工具包,它可以将植物细胞器基因组装配成简单且准确的主图。GSAT 包含许多基于图形的工具,用于处理基因组装配结果和高通量测序数据。这些工具可以帮助研究人员更好地理解和分析基因组数据。

二.安装

#install by using git
git clone https://github.com/hwc2021/GSAT.git
cd GSAT/bin
chmod a+x gsat

#install by downloading the source codes
#put the source code file "GSAT-main.zip" where you want to install in
unzip GSAT-main.zip
cd GSAT-main/bin
chmod a+x gsat
vi ~/.bashrc
#add the next line to the end of .bashrc file ("#" should be removed when paste the next line to the file)
#export PATH=$PATH:/your/path/GSAT/bin
source ~/.bashrc

三.使用

3.1 主程序

gsat <command> [options]

Commands:
-- Functions
   graphFilt            filter the assembly graph with different params
   graphMap             conduct graph mapping to detect mapped paths in a graph for query sequence
   graphCorr            correct the sequences in a graph by using long reads. HIFI reads is recommanded.
   graphSimplify        simplify the graph based on supported mapped paths of long reads.
   rmOverlap            remove the overlaping regions from a graph

-- Pipelines
   graphShort           generate a Organelle Graph from a raw graph of de novo assembly
   graphLong            generate a Mitochondrial Rough Graph from a OG
   graphSimplification  generate a Mitochondrial Rough Master Graph from a MRG
   graphCorrection      generate a Mitochondrial Master Graph from a MRMG

-- Information
   help                 print a brief help information
   man                  print a complete help document
   version              print the version information

3.2 流程

3.2.1

gsat graphShort -conf gsat.cfg

3.2.2

gsat graphLong -conf gsat.cfg

3.2.3

gsat graphSimplification -conf gsat.cfg

3.2.4

gsat graphCorrection -conf gsat.cfg

3.3 功能

3.3.1 graphFilt

gsat graphFilt

3.3.2 graphMap

gsat graphMap -a on -r 3代.fa -g og.filtered.gfa -o 002graphMap -d on -minimap2 ont
graphMap:
        Usage:   gsat graphMap [options]

        -align|-a                 进行reads与图的图映射(需要 -r和 -g参数). [默认off]*
        -readFile|-r [str]        A Pacbio / Nanopore read file in fasta format. NOT available if -a is off.*
        -gfaFile|-g [str]         og.filtered.gfa路径*


        -blast7File|-b [str]      Calculate the mapped paths from a blastn result file. NOT available if -a|-p is applied.*
        -pafFile|-p [str]         Calculate the mapped paths from a minimap2 result file. NOT available if -a|-b is applied.*


        -minRead [int]            The min length (bp) of selected reads. [1000]
        -maxOffset1 [int]         The max offset between the ends of contigs in alignments, regarding the overlaps of contigs. [10]
                                  The real range of offset is from 1-K-offset to 1-K+offset. Not compatible with --maxOffset2.
        -maxOffset2 [int]         The max offset between the ends of contigs in alignments, ignoring the overlaps of contigs. [off]
                                  The real range of offset is from 0-offset to 0+offset. Not compatible with --maxOffset1.
        -maxCombDis [int]         The max distances allowed for combining two alignments. [15]
        -maxEdgeSize1 [int]       The max gap size allowed for the alignment at the edge of reads. [60]
        -maxEdgeSize2 [int]       The max gap size allowed for the alignment at the edge of contigs. [10]
        -maxBounderRatio [float]  The max ratio allowed for the bounder size which covered the full length of a contig. [0.1]
        -maxIdenGap [float]       The max difference allowed for remained an alternative alignment (path)
                                  when compared with to the identity of the best alignment (path). [1]
                                  Caution: It is still a beta method that is not recommanded to use until now.
        -minIden [float]          The min identity allowed for use an alignment (in b7 and paf file). [0.85]
        -minCovofRead [float]     The min coverage allowed in the alignment for use a read (in b7 and paf file). [0.9]
        -minCovbyPath [float]     The min coverage to the read allowed for outputting a path. [0.9]


        -out|-o [str]             The name prefix of output files.
						        这是前缀,不是文件夹。
        -strictBub                Bubbles were retained only when all members were mapped to the read with exactly the same
                                  start and end positions. [on]
        -depth|-d                 计算通过的reads在contigs上的深度。 [off]
        -calDepth|-cd            直接从之前的结果(需要 -o 和 -g 选项)计算深度(如 -d)。 [off]
							    这里-g应该输入mrg.filtered.gfa。
        -filterPaths|-f          当应用 -cd 选项时,进一步过滤之前的结果。 [off]
                                  但是,目前只有 -minRead 和 -minCovbyPath 选项可用。
        -minimap2 [str]           使用 minimap2 将reads映射到长 contigs,而不是使用 blastn。 [off]
                                  The read type should be specified here such as hifi, clr, ont.

        Note: the * denoted a required option.

3.3.3 graphCorr

gsat graphCorr

3.3.4 graphSimplify

gsat graphSimplify

3.3.5 rmOverlap

gsat rmOverlap

3.4 配置文件

/share/nas2/yuj/project/2024/plant_mt/GP-20240319-8026_20240412/data/2dai/1bowtie/gsat.cfg

#这是运行流水线命令的示例配置文件。
#*表示对应流水线的必需选项。
#不同流水线的选项可以放在同一个文件中,因为流水线读取此文件时会忽略无效选项。

#全局参数
out			02graphLong	#输出目录的前缀。


#与graphShort流水线相关的参数
r1			/share/nas2/yuj/project/2024/plant_mt/GP-20240319-8026_20240412/data/2dai/1bowtie/map_pair_hits.1.fq #一对端Illumina测序数据的第一个reads文件。*
r2			/share/nas2/yuj/project/2024/plant_mt/GP-20240319-8026_20240412/data/2dai/1bowtie/map_pair_hits.2.fq #一对端Illumina测序数据的第二个reads文件。*
maxReadLen	127 # reads文件的最大读取长度。*

minDep1		10 #保留的长度大于500bp的contig的最小深度。*
minDep2		20 #保留的长度大于1000bp的contig的最小深度。*
rmSep       off #[on/off] 是否移除与其他contig没有连接的独立contig。


#与graphLong流水线相关的参数
rmBubbPt	off #[on/off] 从bubble中移除pt-like contig。使用此选项时请小心。


#与graphLong和graphSimplification流水线相关的参数
minPathNo	3	#保留连接所需的最小支持路径数。*
minEnd		75	#映射长度短于此值的末端contig将被过滤。*


#与graphLong、graphSimplification和graphCorrection流水线相关的参数
readFile	/share/nas2/yuj/project/2024/plant_mt/GP-20240319-8026_20240412/data/3dai/map_gene.fa #fasta格式的Pacbio/Nanopore reads文件。*
gfaFile		/share/nas2/yuj/project/2024/plant_mt/GP-20240319-8026_20240412/data/2dai/1bowtie/01graphShort/og.filtered.gfa	#输入组装图。*
minRead		1000 #选择读段的最小长度(bp)。
maxOffset1	10 #在比对中,contigs之间的末端的最大偏移量。偏移量的实际范围为1-K-offset至1-K+offset。不兼容--maxOffset2。
#maxOffset2	10  #在比对中,忽略contigs之间的重叠,contigs之间的末端的最大偏移量。偏移量的实际范围为0-offset至0+offset。不兼容--maxOffset1。
maxCombDis	15 #允许组合两个比对的最大距离。
maxEdgeSize1	60 #允许reads边缘比对的最大gap大小。
maxEdgeSize2	10 #允许contigs边缘比对的最大gap大小。
maxBounderRatio	0.1 #允许的覆盖contig全长的边界区域的最大比例。[0.1]
maxIdenGap	1 #与最佳比对的身份差异比较时,允许保留替代比对(路径)的最大差异。注意:这仍然是一个不推荐使用的beta方法。
minIden		0.85 #允许使用比对的最小身份。
minCovofRead	0.9 #比对中允许的最小覆盖率以使用reads。
minCovbyPath	0.9 #输出路径所需的最小reads覆盖率。
strictBub		on #[on/off] 只有当所有成员都将bubble精确映射到具有完全相同起始和结束位置的reads时,才保留bubble。
depth			on #[on/off] 计算通过reads在contigs上的深度。
minimap2		ont #[hifi/clr/ont/off] 使用minimap2将reads映射到长contigs,而不是使用blastn。应在此指定reads类型,如hifi、clr、ont。



#与graphCorrection流水线相关的参数
minReadProp		0.6 #确认基础校正所需的最小受支持reads的比例。